Supervised Learning Classification Project: AllLife Bank Personal
Loan Campaign
Problem Statement
Context
AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with
varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested
in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular,
the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as
depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged
the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers
who have a higher probability of purchasing the loan.
Objective
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving
purchases, and identify which segment of customers to target more.
Data Dictionary
ID : Customer ID
Age : Customer’s age in completed years
Experience : #years of professional experience
Income : Annual income of the customer (in thousand dollars)
ZIP Code : Home Address ZIP code.
Family : the Family size of the customer
CCAvg : Average spending on credit cards per month (in thousand dollars)
Education : Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
Mortgage : Value of house mortgage if any. (in thousand dollars)
Personal_Loan : Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
Securities_Account : Does the customer have securities account with the bank? (0: No, 1: Yes)
CD_Account : Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
Online : Do customers use internet banking facilities? (0: No, 1: Yes)
CreditCard : Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)
Importing necessary libraries
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 200)
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
# had to replace plot_confusion_matrix with ConfusionMatrixDisplay
from sklearn.metrics import (
f1_score,
accuracy_score,
Loading the dataset
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", for
ce_remount=True).
Data Overview
Observations
Sanity checks
ID
Age
Experience
Income
ZIPCode
Family
CCAvg
Education
Mortgage
Personal_Loan
Securities_Account
CD_Account
Online
C
0
1
25
1
49
91107
4
1.6
1
0
0
1
0
0
1
2
45
19
34
90089
3
1.5
1
0
0
1
0
0
2
3
39
15
11
94720
1
1.0
1
0
0
0
0
0
3
4
35
9
100
94112
1
2.7
2
0
0
0
0
0
4
5
35
8
45
91330
4
1.0
2
0
0
0
0
0
ID
Age
Experience
Income
ZIPCode
Family
CCAvg
Education
Mortgage
Personal_Loan
Securities_Account
CD_Account
Onli
4995
4996
29
3
40
92697
1
1.9
3
0
0
0
0
4996
4997
30
4
15
92037
4
0.4
1
85
0
0
0
4997
4998
63
39
24
93023
2
0.3
3
0
0
0
0
4998
4999
65
40
49
90034
3
0.5
2
0
0
0
0
4999
5000
28
4
83
92612
3
0.8
1
0
0
0
0
(5000, 14)
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay,
precision_recall_curve,
roc_curve,
make_scorer,
)
from google.colab import drive
drive.mount('/content/drive')
Loan = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/PGP_in_AI-ML_@UT_Austin-Docs/Module_2/Machine_Learni
data = Loan.copy()
data.head()
data.tail()
data.shape
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 5000 non-null int64
1 Age 5000 non-null int64
2 Experience 5000 non-null int64
3 Income 5000 non-null int64
4 ZIPCode 5000 non-null int64
5 Family 5000 non-null int64
6 CCAvg 5000 non-null float64
7 Education 5000 non-null int64
8 Mortgage 5000 non-null int64
9 Personal_Loan 5000 non-null int64
10 Securities_Account 5000 non-null int64
11 CD_Account 5000 non-null int64
12 Online 5000 non-null int64
13 CreditCard 5000 non-null int64
dtypes: float64(1), int64(13)
memory usage: 547.0 KB
count
mean
std
min
25%
50%
75%
max
ID
5000.0
2500.500000
1443.520003
1.0
1250.75
2500.5
3750.25
5000.0
Age
5000.0
45.338400
11.463166
23.0
35.00
45.0
55.00
67.0
Experience
5000.0
20.104600
11.467954
-3.0
10.00
20.0
30.00
43.0
Income
5000.0
73.774200
46.033729
8.0
39.00
64.0
98.00
224.0
ZIPCode
5000.0
93169.257000
1759.455086
90005.0
91911.00
93437.0
94608.00
96651.0
Family
5000.0
2.396400
1.147663
1.0
1.00
2.0
3.00
4.0
CCAvg
5000.0
1.937938
1.747659
0.0
0.70
1.5
2.50
10.0
Education
5000.0
1.881000
0.839869
1.0
1.00
2.0
3.00
3.0
Mortgage
5000.0
56.498800
101.713802
0.0
0.00
0.0
101.00
635.0
Personal_Loan
5000.0
0.096000
0.294621
0.0
0.00
0.0
0.00
1.0
Securities_Account
5000.0
0.104400
0.305809
0.0
0.00
0.0
0.00
1.0
CD_Account
5000.0
0.060400
0.238250
0.0
0.00
0.0
0.00
1.0
Online
5000.0
0.596800
0.490589
0.0
0.00
1.0
1.00
1.0
CreditCard
5000.0
0.294000
0.455637
0.0
0.00
0.0
1.00
1.0
Exploratory Data Analysis.
EDA is an important part of any project involving data.
It is important to investigate and understand the data better before building a model with it.
A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights
from the data.
A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Questions:
1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
2. How many customers have credit cards?
3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
4. How does a customer's interest in purchasing a loan vary with their age?
5. How does a customer's interest in purchasing a loan vary with their education?
data.describe().T
data = data.drop(columns=['ID'])
#This line creates a combined boxplot and histogram for a specified feature in a DataFrame.
#The boxplot displays the summary statistics, while the histogram represents the distribution of the data
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2,
sharex=True,
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
)
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="xkcd:pale purple"
)
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, color="green"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
)
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
)
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
)
#labeled_barplot function generates a bar plot for a specified column from a Pandas dataframe with either the c
#this function creates the bar plot with a customized color palette,
#uses matplotlib functionalities to rotate the x-axis labels for better readability
#set the figure size dynamically based on the number of unique values to be displayed, and annotate each bar wi
def labeled_barplot(data, feature, perc=False, n=None):
total = len(data[feature])
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
)
else:
label = p.get_height()
x = p.get_x() + p.get_width() / 2
y = p.get_height()
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
)
plt.show()
#This function generates a combined histogram and boxplot visualization for the "Age" column in the
#data frame, to facilitate the statistical analysis of the age distribution in the dataset.
histogram_boxplot(data, "Age")
#using Seaborn and Matplotlib libraries to create a vertical 2-row subplot:
#the first row contains a histogram with a KDE (Kernel Density Estimate) plot overlay representing
#the distribution of values in the 'Experience' column of the 'data' DataFrame, and the second row
#contains a boxplot of the same data to give insights into the statistical properties of the 'Experience'
#column (such as the median, interquartile range, etc.).
sns.set_style("darkgrid")
#combo
fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(7, 10))
#this is the histogram code
sns.histplot(data['Experience'], bins=10, kde=True, ax=axes[0])
axes[0].set_title('Histogram of Experience')
axes[0].set_xlabel('Years of Experience')
axes[0].set_ylabel('Frequency')
#this is the boxplot code
sns.boxplot(x=data['Experience'], ax=axes[1])
axes[1].set_title('Boxplot of Experience')
axes[1].set_xlabel('Years of Experience')
plt.tight_layout()
plt.show()
#Same data as above^ is being analyzed, but instead shown in a Histogram and box plot. Will use this same code
histogram_boxplot(data, "Experience")
#This function generates a combined histogram and boxplot visualization for the "Income" column in the
#data frame, to facilitate the statistical analysis of the income distribution in the dataset.
histogram_boxplot(data, "Income")
#This function generates a combined histogram and boxplot visualization for the "CCAvg." column in the
#data frame, to facilitate the statistical analysis of the CCAvg distribution in the dataset.
histogram_boxplot(data, "CCAvg")
#This function generates a combined histogram and boxplot visualization for the "Mortgage" column in the
#data frame, to facilitate the statistical analysis of the Mortgage distribution in the dataset.
histogram_boxplot(data, "Mortgage")
#column from that dataset referred to as "Family", and boolean flag set to True, instructs the function to disp
#percentages on the bar plot instead of counts.
#labeled_barplot function creates a bar plot visualizing the distribution of values in the "Family" column, with
#with the respective percentage of total counts it represents.
labeled_barplot(data, "Family", perc=True)
#column from that dataset referred to as "Education", and boolean flag set to True, instructs the function to di
#percentages on the bar plot instead of counts.
#labeled_barplot function creates a bar plot visualizing the distribution of values in the "Education" column, w
#with the respective percentage of total counts it represents.
labeled_barplot(data, "Education",perc=True)
#column from that dataset referred to as "Securities_Account", and boolean flag set to True, instructs the func
#percentages on the bar plot instead of counts.
#labeled_barplot function creates a bar plot visualizing the distribution of values in the "Securities_Account"
#with the respective percentage of total counts it represents.
labeled_barplot(data,"Securities_Account")
#column from that dataset referred to as "CD_Account", and boolean flag set to True, instructs the function to
#percentages on the bar plot instead of counts.
#labeled_barplot function creates a bar plot visualizing the distribution of values in the "CD_Account" column,
#with the respective percentage of total counts it represents.
labeled_barplot(data,"CD_Account")
#column from that dataset referred to as "Online", and boolean flag set to True, instructs the function to disp
#percentages on the bar plot instead of counts.
#labeled_barplot function creates a bar plot visualizing the distribution of values in the "Online" column, with
#with the respective percentage of total counts it represents.
labeled_barplot(data,"Online")
#column from that dataset referred to as "CreditCard", and boolean flag set to True, instructs the function to
#percentages on the bar plot instead of counts.
#labeled_barplot function creates a bar plot visualizing the distribution of values in the "CreditCard" column,
#with the respective percentage of total counts it represents.
labeled_barplot(data,"CreditCard")
#There are so many ZIPCodes, showing visiblity here and will show top 10 on next line of code.
zip_counts = data['ZIPCode'].value_counts()
plt.figure(figsize=(10,6))
barplot = sns.barplot(x=zip_counts.index, y=zip_counts.values, palette='viridis')
for p in barplot.patches:
barplot.annotate(format(p.get_height(), '.0f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center',
xytext=(0, 9),
textcoords='offset points')
plt.title('Frequency of ZIP Codes')
plt.ylabel('Frequency')
plt.xlabel('ZIP Code')
plt.xticks(rotation=45)
plt.show()
top_10_zip_counts = data['ZIPCode'].value_counts().head(10)
#Using barplot to visualize the top 10 most frequent ZIP codes in the top_10_zip_counts data.
#The bars represent different ZIP codes (on the X-axis) and their respective frequencies (on the Y-axis);
#each bar is annotated with its exact frequency value at the top center, and the X-axis labels are rotated by 45
plt.figure(figsize=(12,7))
barplot = sns.barplot(x=top_10_zip_counts.index, y=top_10_zip_counts.values, palette='viridis')
for p in barplot.patches:
barplot.annotate(format(p.get_height(), '.0f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center',
xytext=(0, 9),
textcoords='offset points')
plt.title('Top 10 Most Frequent ZIP Codes')
plt.ylabel('Frequency')
plt.xlabel('ZIP Code')
plt.xticks(rotation=45)
plt.show()
#stacked_barplot functioncreates two cross-tabulations (contingency tables): one displaying the counts of each
#combination of the predictor and target variables (tab1) and the other displaying these counts as proportions
def stacked_barplot(data, predictor, target):
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
#distribution_plot_wrt_target function generates a 2x2 grid of plots to visually analyze the distribution of a p
#variable with respect to two unique target variable values
#it also uses histograms with KDE for individual target values and box plots to compare the predictor variable
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
Personal_Loan 0 1 All
Education
All 4520 480 5000
3 1296 205 1501
2 1221 182 1403
1 2003 93 2096
---------------------------------------------------------------------------------------------------------------
---------
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
#displays a heatmap of the data
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
#This line of code calls the stacked_barplot function with "Education" as the predictor variable
#and "Personal_Loan" as the target variable to create a stacked bar plot visualizing the distribution
#of personal loan acceptance across different education levels using the data from the data dataframe.
stacked_barplot(data, "Education", "Personal_Loan")
#This line creates and displays a stacked bar plot that illustrates the distribution
#of personal loans across different family sizes/types, using data grouped by the "Family" and "Personal_Loan"
pivot_data = data.groupby(['Family', 'Personal_Loan']).size().unstack()
pivot_data.plot(kind='bar', stacked=True, figsize=(10, 7))
plt.title("Stacked Barplot of Personal Loan by Family")
plt.ylabel("Count")
plt.xlabel("Family Size/Type")
plt.show()
Personal_Loan 0 1 All
Securities_Account
All 4520 480 5000
0 4058 420 4478
1 462 60 522
---------------------------------------------------------------------------------------------------------------
---------
CD_Account 0 1 All
Personal_Loan
All 4698 302 5000
0 4358 162 4520
1 340 140 480
---------------------------------------------------------------------------------------------------------------
---------
#This line of code creates a stacked bar plot visualizing the relationships between the "Securities_Account" an
stacked_barplot(data, "Securities_Account", "Personal_Loan")
#This line of code creates a stacked bar plot visualizing the relationships between the "CD_Account" and "Person
stacked_barplot(data, "Personal_Loan", "CD_Account")
Online 0 1 All
Personal_Loan
All 2016 2984 5000
0 1827 2693 4520
1 189 291 480
---------------------------------------------------------------------------------------------------------------
---------
CreditCard 0 1 All
Personal_Loan
All 3530 1470 5000
0 3193 1327 4520
1 337 143 480
---------------------------------------------------------------------------------------------------------------
---------
#This line of code creates a stacked bar plot visualizing the relationships between the "Online" and "Personal_L
stacked_barplot(data, "Personal_Loan", "Online")
#This line of code creates a stacked bar plot visualizing the relationships between the "CreditCard" and "Person
stacked_barplot(data, "Personal_Loan", "CreditCard")
#TOO MANY ZIP CODES TO PLOT, USING TOP 10, stacked_barplot takes a dataframe, two column names (target and categ
#and an optional parameter specifying the number of top categories to consider, creating a stacked bar plot vis
#the distribution of the target column categories across the top N categories (plus an 'Other' category for all
#finally, it cleans up by removing the additional column it created for aggregation.
def stacked_barplot(data, target_col, category_col, top_n=10):
# Get the top N categories
top_categories = data[category_col].value_counts().head(top_n).index.tolist()
data['agg_category'] = data[category_col].apply(lambda x: x if x in top_categories else "Other")
pivot_data = data.groupby(['agg_category', target_col]).size().unstack()
pivot_data.sort_values(by=1, ascending=False).plot(kind='bar', stacked=True, figsize=(12, 7))
plt.title(f"Stacked Barplot of {target_col} by Top {top_n} {category_col}")
plt.ylabel("Count")
plt.xlabel(category_col)
plt.xticks(rotation=45)
plt.show()
data.drop(columns='agg_category', inplace=True)
#This line of code creates a stacked bar plot visualizing the relationships between the "ZIPCode" and "Personal_
stacked_barplot(data, 'Personal_Loan', 'ZIPCode')
#This line will plot the distribution of ages for those with and without a personal loan, along with box plots
distribution_plot_wrt_target(data, "Age", "Personal_Loan")
#distribution_plot_wrt_target is being called with "Experience" as the predictor variable and "Personal_Loan" a
#and will visualize the distribution of "Experience" values for different "Personal_Loan" categories through a
distribution_plot_wrt_target(data, "Experience", "Personal_Loan")
#distribution_plot_wrt_target is being called with "Income" as the predictor variable and "Personal_Loan" as the
#and will visualize the distribution of "Income" values for different "Personal_Loan" categories through a serie
distribution_plot_wrt_target(data, "Income", "Personal_Loan")
#distribution_plot_wrt_target is being called and it is plotting the distribution of "CCAvg" (Credit Card Averag
#with respect to the "Personal_Loan" target variable, showcasing the relationship between
#credit card spending averages and personal loan statuses.
distribution_plot_wrt_target(data, "CCAvg", "Personal_Loan")
Data Preprocessing
Missing value treatment
Feature engineering (if needed)
Outlier detection and treatment (if needed)
Preparing data for modeling
Any other preprocessing steps (if needed)
Age 0.00
Experience 0.00
Income 1.92
ZIPCode 0.00
Family 0.00
CCAvg 6.48
Education 0.00
Mortgage 5.82
Personal_Loan 9.60
Securities_Account 10.44
CD_Account 6.04
Online 0.00
CreditCard 0.00
dtype: float64
#This line detects outliers
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
((data.select_dtypes(include=["float64", "int64"]) < lower)
|(data.select_dtypes(include=["float64", "int64"]) > upper)
).sum() / len(data) * 100
#creating a feature matrix "X" by dropping the "Personal_Loan" and "Experience" columns from the data, and a ta
Shape of Training set : (3500, 477)
Shape of test set : (1500, 477)
Percentage of classes in training set:
0 0.905429
1 0.094571
Name: Personal_Loan, dtype: float64
Percentage of classes in test set:
0 0.900667
1 0.099333
Name: Personal_Loan, dtype: float64
Model Building
Model Evaluation Criterion
Accuracy
Recall
Precision
F1
0
1.0
1.0
1.0
1.0
X = data.drop(["Personal_Loan", "Experience"], axis=1)
Y = data["Personal_Loan"]
#This line of code is applying one-hot encoding to the "ZIPCode" and "Education" columns of the data, and then
X = pd.get_dummies(X, columns=["ZIPCode", "Education"], drop_first=True)
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.30, random_state=1)
#This line prints the shapes of the training and test datasets and displays the percentage distribution of the
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
#This line defines(accuracy, recall, precision, and F1 score) and plot a confusion matrix using seaborn's heatma
#visually representing the performance of the classification model on the data.
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix
def model_performance_and_confusion_matrix(model, predictors, target):
pred = model.predict(predictors) #pred is computed once and used in the metrics calculation AND the confusi
metrics = {
"Accuracy": accuracy_score(target, pred),
"Recall": recall_score(target, pred),
"Precision": precision_score(target, pred),
"F1": f1_score(target, pred)
}
df_perf = pd.DataFrame(metrics, index=[0])
cm = confusion_matrix(target, pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
return df_perf #The function will return the performance metrics DataFrame and plots the confusion matrix
#This line of code initializes a Decision Tree classifier with the "gini" criterion and a fixed random state, an
model = DecisionTreeClassifier(criterion="gini", random_state=1)
model.fit(X_train, y_train)
#This evaluates the model's performance using the training data, by calculating various metrics
#(accuracy, recall, precision, and F1 score) and visualizing the confusion matrix,
#then displaying it
model_performance_and_confusion_matrix(model, X_train, y_train)
▾
DecisionTreeClassifier
DecisionTreeClassifier(random_state=1)
Accuracy
Recall
Precision
F1
0
1.0
1.0
1.0
1.0
Visualizing Decision Tree
#This line will evaluate the decision tree model's performance on the training data, and store the performance m
decision_tree_perf_train = model_performance_and_confusion_matrix (model, X_train, y_train)
decision_tree_perf_train
#This line creates a list of feature names from the columns of the X_train dataframe, then prints it
feature_names = list(X_train.columns)
print(feature_names)
['Age', 'Income', 'Family', 'CCAvg', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'Z
IPCode_90007', 'ZIPCode_90009', 'ZIPCode_90011', 'ZIPCode_90016', 'ZIPCode_90018', 'ZIPCode_90019', 'ZIPCode_90
024', 'ZIPCode_90025', 'ZIPCode_90027', 'ZIPCode_90028', 'ZIPCode_90029', 'ZIPCode_90032', 'ZIPCode_90033', 'ZI
PCode_90034', 'ZIPCode_90035', 'ZIPCode_90036', 'ZIPCode_90037', 'ZIPCode_90041', 'ZIPCode_90044', 'ZIPCode_900
45', 'ZIPCode_90048', 'ZIPCode_90049', 'ZIPCode_90057', 'ZIPCode_90058', 'ZIPCode_90059', 'ZIPCode_90064', 'ZIP
Code_90065', 'ZIPCode_90066', 'ZIPCode_90068', 'ZIPCode_90071', 'ZIPCode_90073', 'ZIPCode_90086', 'ZIPCode_9008
9', 'ZIPCode_90095', 'ZIPCode_90210', 'ZIPCode_90212', 'ZIPCode_90230', 'ZIPCode_90232', 'ZIPCode_90245', 'ZIPC
ode_90250', 'ZIPCode_90254', 'ZIPCode_90266', 'ZIPCode_90272', 'ZIPCode_90274', 'ZIPCode_90275', 'ZIPCode_90277
', 'ZIPCode_90280', 'ZIPCode_90291', 'ZIPCode_90304', 'ZIPCode_90401', 'ZIPCode_90404', 'ZIPCode_90405', 'ZIPCo
de_90502', 'ZIPCode_90503', 'ZIPCode_90504', 'ZIPCode_90505', 'ZIPCode_90509', 'ZIPCode_90601', 'ZIPCode_90623'
, 'ZIPCode_90630', 'ZIPCode_90638', 'ZIPCode_90639', 'ZIPCode_90640', 'ZIPCode_90650', 'ZIPCode_90717', 'ZIPCod
e_90720', 'ZIPCode_90740', 'ZIPCode_90745', 'ZIPCode_90747', 'ZIPCode_90755', 'ZIPCode_90813', 'ZIPCode_90840',
'ZIPCode_91006', 'ZIPCode_91007', 'ZIPCode_91016', 'ZIPCode_91024', 'ZIPCode_91030', 'ZIPCode_91040', 'ZIPCode_
91101', 'ZIPCode_91103', 'ZIPCode_91105', 'ZIPCode_91107', 'ZIPCode_91109', 'ZIPCode_91116', 'ZIPCode_91125', '
ZIPCode_91129', 'ZIPCode_91203', 'ZIPCode_91207', 'ZIPCode_91301', 'ZIPCode_91302', 'ZIPCode_91304', 'ZIPCode_9
1311', 'ZIPCode_91320', 'ZIPCode_91326', 'ZIPCode_91330', 'ZIPCode_91335', 'ZIPCode_91342', 'ZIPCode_91343', 'Z
IPCode_91345', 'ZIPCode_91355', 'ZIPCode_91360', 'ZIPCode_91361', 'ZIPCode_91365', 'ZIPCode_91367', 'ZIPCode_91
380', 'ZIPCode_91401', 'ZIPCode_91423', 'ZIPCode_91604', 'ZIPCode_91605', 'ZIPCode_91614', 'ZIPCode_91706', 'ZI
PCode_91709', 'ZIPCode_91710', 'ZIPCode_91711', 'ZIPCode_91730', 'ZIPCode_91741', 'ZIPCode_91745', 'ZIPCode_917
54', 'ZIPCode_91763', 'ZIPCode_91765', 'ZIPCode_91768', 'ZIPCode_91770', 'ZIPCode_91773', 'ZIPCode_91775', 'ZIP
Code_91784', 'ZIPCode_91791', 'ZIPCode_91801', 'ZIPCode_91902', 'ZIPCode_91910', 'ZIPCode_91911', 'ZIPCode_9194
1', 'ZIPCode_91942', 'ZIPCode_91950', 'ZIPCode_92007', 'ZIPCode_92008', 'ZIPCode_92009', 'ZIPCode_92024', 'ZIPC
ode_92028', 'ZIPCode_92029', 'ZIPCode_92037', 'ZIPCode_92038', 'ZIPCode_92054', 'ZIPCode_92056', 'ZIPCode_92064
', 'ZIPCode_92068', 'ZIPCode_92069', 'ZIPCode_92084', 'ZIPCode_92093', 'ZIPCode_92096', 'ZIPCode_92101', 'ZIPCo
de_92103', 'ZIPCode_92104', 'ZIPCode_92106', 'ZIPCode_92109', 'ZIPCode_92110', 'ZIPCode_92115', 'ZIPCode_92116'
, 'ZIPCode_92120', 'ZIPCode_92121', 'ZIPCode_92122', 'ZIPCode_92123', 'ZIPCode_92124', 'ZIPCode_92126', 'ZIPCod
e_92129', 'ZIPCode_92130', 'ZIPCode_92131', 'ZIPCode_92152', 'ZIPCode_92154', 'ZIPCode_92161', 'ZIPCode_92173',
'ZIPCode_92177', 'ZIPCode_92182', 'ZIPCode_92192', 'ZIPCode_92220', 'ZIPCode_92251', 'ZIPCode_92325', 'ZIPCode_
92333', 'ZIPCode_92346', 'ZIPCode_92350', 'ZIPCode_92354', 'ZIPCode_92373', 'ZIPCode_92374', 'ZIPCode_92399', '
ZIPCode_92407', 'ZIPCode_92507', 'ZIPCode_92518', 'ZIPCode_92521', 'ZIPCode_92606', 'ZIPCode_92612', 'ZIPCode_9
2614', 'ZIPCode_92624', 'ZIPCode_92626', 'ZIPCode_92630', 'ZIPCode_92634', 'ZIPCode_92646', 'ZIPCode_92647', 'Z
IPCode_92648', 'ZIPCode_92653', 'ZIPCode_92660', 'ZIPCode_92661', 'ZIPCode_92672', 'ZIPCode_92673', 'ZIPCode_92
675', 'ZIPCode_92677', 'ZIPCode_92691', 'ZIPCode_92692', 'ZIPCode_92694', 'ZIPCode_92697', 'ZIPCode_92703', 'ZI
PCode_92704', 'ZIPCode_92705', 'ZIPCode_92709', 'ZIPCode_92717', 'ZIPCode_92735', 'ZIPCode_92780', 'ZIPCode_928
06', 'ZIPCode_92807', 'ZIPCode_92821', 'ZIPCode_92831', 'ZIPCode_92833', 'ZIPCode_92834', 'ZIPCode_92835', 'ZIP
Code_92843', 'ZIPCode_92866', 'ZIPCode_92867', 'ZIPCode_92868', 'ZIPCode_92870', 'ZIPCode_92886', 'ZIPCode_9300
3', 'ZIPCode_93009', 'ZIPCode_93010', 'ZIPCode_93014', 'ZIPCode_93022', 'ZIPCode_93023', 'ZIPCode_93033', 'ZIPC
ode_93063', 'ZIPCode_93065', 'ZIPCode_93077', 'ZIPCode_93101', 'ZIPCode_93105', 'ZIPCode_93106', 'ZIPCode_93107
', 'ZIPCode_93108', 'ZIPCode_93109', 'ZIPCode_93111', 'ZIPCode_93117', 'ZIPCode_93118', 'ZIPCode_93302', 'ZIPCo
de_93305', 'ZIPCode_93311', 'ZIPCode_93401', 'ZIPCode_93403', 'ZIPCode_93407', 'ZIPCode_93437', 'ZIPCode_93460'
, 'ZIPCode_93524', 'ZIPCode_93555', 'ZIPCode_93561', 'ZIPCode_93611', 'ZIPCode_93657', 'ZIPCode_93711', 'ZIPCod
e_93720', 'ZIPCode_93727', 'ZIPCode_93907', 'ZIPCode_93933', 'ZIPCode_93940', 'ZIPCode_93943', 'ZIPCode_93950',
'ZIPCode_93955', 'ZIPCode_94002', 'ZIPCode_94005', 'ZIPCode_94010', 'ZIPCode_94015', 'ZIPCode_94019', 'ZIPCode_
94022', 'ZIPCode_94024', 'ZIPCode_94025', 'ZIPCode_94028', 'ZIPCode_94035', 'ZIPCode_94040', 'ZIPCode_94043', '
ZIPCode_94061', 'ZIPCode_94063', 'ZIPCode_94065', 'ZIPCode_94066', 'ZIPCode_94080', 'ZIPCode_94085', 'ZIPCode_9
4086', 'ZIPCode_94087', 'ZIPCode_94102', 'ZIPCode_94104', 'ZIPCode_94105', 'ZIPCode_94107', 'ZIPCode_94108', 'Z
IPCode_94109', 'ZIPCode_94110', 'ZIPCode_94111', 'ZIPCode_94112', 'ZIPCode_94114', 'ZIPCode_94115', 'ZIPCode_94
116', 'ZIPCode_94117', 'ZIPCode_94118', 'ZIPCode_94122', 'ZIPCode_94123', 'ZIPCode_94124', 'ZIPCode_94126', 'ZI
PCode_94131', 'ZIPCode_94132', 'ZIPCode_94143', 'ZIPCode_94234', 'ZIPCode_94301', 'ZIPCode_94302', 'ZIPCode_943
03', 'ZIPCode_94304', 'ZIPCode_94305', 'ZIPCode_94306', 'ZIPCode_94309', 'ZIPCode_94402', 'ZIPCode_94404', 'ZIP
Code_94501', 'ZIPCode_94507', 'ZIPCode_94509', 'ZIPCode_94521', 'ZIPCode_94523', 'ZIPCode_94526', 'ZIPCode_9453
4', 'ZIPCode_94536', 'ZIPCode_94538', 'ZIPCode_94539', 'ZIPCode_94542', 'ZIPCode_94545', 'ZIPCode_94546', 'ZIPC
ode_94550', 'ZIPCode_94551', 'ZIPCode_94553', 'ZIPCode_94555', 'ZIPCode_94558', 'ZIPCode_94566', 'ZIPCode_94571
', 'ZIPCode_94575', 'ZIPCode_94577', 'ZIPCode_94583', 'ZIPCode_94588', 'ZIPCode_94590', 'ZIPCode_94591', 'ZIPCo
de_94596', 'ZIPCode_94598', 'ZIPCode_94604', 'ZIPCode_94606', 'ZIPCode_94607', 'ZIPCode_94608', 'ZIPCode_94609'
, 'ZIPCode_94610', 'ZIPCode_94611', 'ZIPCode_94612', 'ZIPCode_94618', 'ZIPCode_94701', 'ZIPCode_94703', 'ZIPCod
e_94704', 'ZIPCode_94705', 'ZIPCode_94706', 'ZIPCode_94707', 'ZIPCode_94708', 'ZIPCode_94709', 'ZIPCode_94710',
'ZIPCode_94720', 'ZIPCode_94801', 'ZIPCode_94803', 'ZIPCode_94806', 'ZIPCode_94901', 'ZIPCode_94904', 'ZIPCode_
94920', 'ZIPCode_94923', 'ZIPCode_94928', 'ZIPCode_94939', 'ZIPCode_94949', 'ZIPCode_94960', 'ZIPCode_94965', '
ZIPCode_94970', 'ZIPCode_94998', 'ZIPCode_95003', 'ZIPCode_95005', 'ZIPCode_95006', 'ZIPCode_95008', 'ZIPCode_9
5010', 'ZIPCode_95014', 'ZIPCode_95020', 'ZIPCode_95023', 'ZIPCode_95032', 'ZIPCode_95035', 'ZIPCode_95037', 'Z
IPCode_95039', 'ZIPCode_95045', 'ZIPCode_95051', 'ZIPCode_95053', 'ZIPCode_95054', 'ZIPCode_95060', 'ZIPCode_95
064', 'ZIPCode_95070', 'ZIPCode_95112', 'ZIPCode_95120', 'ZIPCode_95123', 'ZIPCode_95125', 'ZIPCode_95126', 'ZI
PCode_95131', 'ZIPCode_95133', 'ZIPCode_95134', 'ZIPCode_95135', 'ZIPCode_95136', 'ZIPCode_95138', 'ZIPCode_951
92', 'ZIPCode_95193', 'ZIPCode_95207', 'ZIPCode_95211', 'ZIPCode_95307', 'ZIPCode_95348', 'ZIPCode_95351', 'ZIP
Code_95354', 'ZIPCode_95370', 'ZIPCode_95403', 'ZIPCode_95405', 'ZIPCode_95422', 'ZIPCode_95449', 'ZIPCode_9548
2', 'ZIPCode_95503', 'ZIPCode_95518', 'ZIPCode_95521', 'ZIPCode_95605', 'ZIPCode_95616', 'ZIPCode_95617', 'ZIPC
ode_95621', 'ZIPCode_95630', 'ZIPCode_95670', 'ZIPCode_95678', 'ZIPCode_95741', 'ZIPCode_95747', 'ZIPCode_95758
', 'ZIPCode_95762', 'ZIPCode_95812', 'ZIPCode_95814', 'ZIPCode_95816', 'ZIPCode_95817', 'ZIPCode_95818', 'ZIPCo
de_95819', 'ZIPCode_95820', 'ZIPCode_95821', 'ZIPCode_95822', 'ZIPCode_95825', 'ZIPCode_95827', 'ZIPCode_95828'
, 'ZIPCode_95831', 'ZIPCode_95833', 'ZIPCode_95841', 'ZIPCode_95842', 'ZIPCode_95929', 'ZIPCode_95973', 'ZIPCod
e_96001', 'ZIPCode_96003', 'ZIPCode_96008', 'ZIPCode_96064', 'ZIPCode_96091', 'ZIPCode_96094', 'ZIPCode_96145',
'ZIPCode_96150', 'ZIPCode_96651', 'Education_2', 'Education_3']
#creates filled nodes, defined font size, and adding black edges around the arrows, and then display it. Bottom
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
|--- Income <= 116.50
| |--- CCAvg <= 2.95
plt.show()
#Will print a text representation of the decision tree model, including the names of the features used for each
print(tree.export_text(model, feature_names=feature_names, show_weights=True))
| | |--- Income <= 106.50
| | | |--- weights: [2553.00, 0.00] class: 0
| | |--- Income > 106.50
| | | |--- Family <= 3.50
| | | | |--- ZIPCode_90049 <= 0.50
| | | | | |--- ZIPCode_92007 <= 0.50
| | | | | | |--- ZIPCode_93106 <= 0.50
| | | | | | | |--- weights: [63.00, 0.00] class: 0
| | | | | | |--- ZIPCode_93106 > 0.50
| | | | | | | |--- weights: [0.00, 1.00] class: 1
| | | | | |--- ZIPCode_92007 > 0.50
| | | | | | |--- weights: [0.00, 1.00] class: 1
| | | | |--- ZIPCode_90049 > 0.50
| | | | | |--- weights: [0.00, 1.00] class: 1
| | | |--- Family > 3.50
| | | | |--- Age <= 32.50
| | | | | |--- CCAvg <= 2.40
| | | | | | |--- weights: [12.00, 0.00] class: 0
| | | | | |--- CCAvg > 2.40
| | | | | | |--- weights: [0.00, 1.00] class: 1
| | | | |--- Age > 32.50
| | | | | |--- Age <= 60.00
| | | | | | |--- weights: [0.00, 6.00] class: 1
| | | | | |--- Age > 60.00
| | | | | | |--- weights: [4.00, 0.00] class: 0
| |--- CCAvg > 2.95
| | |--- Income <= 92.50
| | | |--- CD_Account <= 0.50
| | | | |--- ZIPCode_91360 <= 0.50
| | | | | |--- ZIPCode_92220 <= 0.50
| | | | | | |--- ZIPCode_94709 <= 0.50
| | | | | | | |--- ZIPCode_92521 <= 0.50
| | | | | | | | |--- ZIPCode_91203 <= 0.50
| | | | | | | | | |--- ZIPCode_94122 <= 0.50
| | | | | | | | | | |--- ZIPCode_94105 <= 0.50
| | | | | | | | | | | |--- truncated branch of depth 5
| | | | | | | | | | |--- ZIPCode_94105 > 0.50
| | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1
| | | | | | | | | |--- ZIPCode_94122 > 0.50
| | | | | | | | | | |--- weights: [0.00, 1.00] class: 1
| | | | | | | | |--- ZIPCode_91203 > 0.50
| | | | | | | | | |--- weights: [0.00, 1.00] class: 1
| | | | | | | |--- ZIPCode_92521 > 0.50
| | | | | | | | |--- weights: [0.00, 1.00] class: 1
| | | | | | |--- ZIPCode_94709 > 0.50
| | | | | | | |--- weights: [0.00, 1.00] class: 1
| | | | | |--- ZIPCode_92220 > 0.50
| | | | | | |--- weights: [0.00, 1.00] class: 1
| | | | |--- ZIPCode_91360 > 0.50
| | | | | |--- weights: [0.00, 1.00] class: 1
| | | |--- CD_Account > 0.50
| | | | |--- weights: [0.00, 5.00] class: 1
| | |--- Income > 92.50
| | | |--- Family <= 2.50
| | | | |--- Education_2 <= 0.50
| | | | | |--- Education_3 <= 0.50
| | | | | | |--- CD_Account <= 0.50
| | | | | | | |--- ZIPCode_90034 <= 0.50
| | | | | | | | |--- weights: [28.00, 0.00] class: 0
| | | | | | | |--- ZIPCode_90034 > 0.50
| | | | | | | | |--- Income <= 103.50
| | | | | | | | | |--- weights: [0.00, 1.00] class: 1
| | | | | | | | |--- Income > 103.50
| | | | | | | | | |--- weights: [1.00, 0.00] class: 0
| | | | | | |--- CD_Account > 0.50
| | | | | | | |--- CCAvg <= 4.75
| | | | | | | | |--- weights: [0.00, 2.00] class: 1
| | | | | | | |--- CCAvg > 4.75
| | | | | | | | |--- weights: [1.00, 0.00] class: 0
| | | | | |--- Education_3 > 0.50
| | | | | | |--- CCAvg <= 3.95
| | | | | | | |--- ZIPCode_90277 <= 0.50
| | | | | | | | |--- weights: [0.00, 5.00] class: 1
| | | | | | | |--- ZIPCode_90277 > 0.50
| | | | | | | | |--- weights: [1.00, 0.00] class: 0
| | | | | | |--- CCAvg > 3.95
| | | | | | | |--- Income <= 107.00
| | | | | | | | |--- weights: [6.00, 0.00] class: 0
| | | | | | | |--- Income > 107.00
| | | | | | | | |--- weights: [0.00, 2.00] class: 1
| | | | |--- Education_2 > 0.50
| | | | | |--- weights: [0.00, 4.00] class: 1
| | | |--- Family > 2.50
| | | | |--- Age <= 57.50
| | | | | |--- ZIPCode_90245 <= 0.50
| | | | | | |--- weights: [0.00, 20.00] class: 1
| | | | | |--- ZIPCode_90245 > 0.50
| | | | | | |--- weights: [1.00, 0.00] class: 0
| | | | |--- Age > 57.50
| | | | | |--- Income <= 97.50
| | | | | | |--- weights: [0.00, 2.00] class: 1
| | | | | |--- Income > 97.50
| | | | | | |--- ZIPCode_94606 <= 0.50
| | | | | | | |--- weights: [7.00, 0.00] class: 0
| | | | | | |--- ZIPCode_94606 > 0.50
| | | | | | | |--- weights: [0.00, 1.00] class: 1
|--- Income > 116.50
| |--- Family <= 2.50
| | |--- Education_3 <= 0.50
| | | |--- Education_2 <= 0.50
| | | | |--- weights: [375.00, 0.00] class: 0
| | | |--- Education_2 > 0.50
| | | | |--- weights: [0.00, 53.00] class: 1
| | |--- Education_3 > 0.50
| | | |--- weights: [0.00, 62.00] class: 1
| |--- Family > 2.50
| | |--- weights: [0.00, 154.00] class: 1
Imp
Income 0.308577
Family 0.246862
Education_2 0.165238
Education_3 0.144207
CCAvg 0.048662
... ...
ZIPCode_92110 0.000000
ZIPCode_92109 0.000000
ZIPCode_92106 0.000000
ZIPCode_92104 0.000000
ZIPCode_93009 0.000000
[477 rows x 1 columns]
#displays the importance scores of each feature used in the decision tree model, sorted in descending order of i
print(
pd.DataFrame(
model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
#creating a horizontal bar chart that visualizes the relative importance of each feature used in the decision t
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="green", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Checking model performance on test data
Accuracy
Recall
Precision
F1
0
0.984
0.879195
0.956204
0.916084
#evaluating the performance of the previously trained model using a test dataset
#and displaying the confusion matrix and performance metrics, accuracy, recall, precision, and F1 score.
model_performance_and_confusion_matrix(model, X_test, y_test)
#storing the performance metrics and confusion matrix of the decision tree model evaluated on the test set into
decision_tree_perf_test = model_performance_and_confusion_matrix(model, X_test, y_test)
decision_tree_perf_test
Accuracy
Recall
Precision
F1
0
0.984
0.879195
0.956204
0.916084
Pre-pruning
Accuracy
Recall
Precision
F1
0
1.0
1.0
1.0
1.0
#setting up a grid search with cross-validation to find the best hyperparameters
#for a Decision Tree classifier using the training data; it then fits the classifier
#with the best parameters to the training data
estimator = DecisionTreeClassifier(random_state=1)
parameters = {
"max_depth": np.arange(6, 15),
"min_samples_leaf": [1, 2, 5, 7, 10],
"max_leaf_nodes": [2, 3, 5, 10],
}
acc_scorer = make_scorer(recall_score)
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
estimator = grid_obj.best_estimator_
estimator.fit(X_train, y_train)
#Checks the perfomance on the training data
model_performance_and_confusion_matrix(model, X_train, y_train)
#Checks performance on training data
decision_tree_tune_perf_train = model_performance_and_confusion_matrix(model, X_train, y_train)
▾
DecisionTreeClassifier
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=10, min_samples_leaf=10,
random_state=1)
Accuracy
Recall
Precision
F1
0
1.0
1.0
1.0
1.0
Visualizing the Decision Tree
decision_tree_tune_perf_train
#This line plots the decision tree of the best estimator found in the grid search, bottom part will add arrows
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
|--- Income <= 116.50
| |--- CCAvg <= 2.95
| | |--- Income <= 106.50
| | | |--- weights: [2553.00, 0.00] class: 0
| | |--- Income > 106.50
| | | |--- weights: [79.00, 10.00] class: 0
| |--- CCAvg > 2.95
| | |--- Income <= 92.50
| | | |--- weights: [117.00, 15.00] class: 0
| | |--- Income > 92.50
| | | |--- Family <= 2.50
| | | | |--- weights: [37.00, 14.00] class: 0
| | | |--- Family > 2.50
| | | | |--- Age <= 57.50
| | | | | |--- weights: [1.00, 20.00] class: 1
| | | | |--- Age > 57.50
| | | | | |--- weights: [7.00, 3.00] class: 0
|--- Income > 116.50
| |--- Family <= 2.50
| | |--- Education_3 <= 0.50
| | | |--- Education_2 <= 0.50
| | | | |--- weights: [375.00, 0.00] class: 0
| | | |--- Education_2 > 0.50
| | | | |--- weights: [0.00, 53.00] class: 1
| | |--- Education_3 > 0.50
| | | |--- weights: [0.00, 62.00] class: 1
| |--- Family > 2.50
| | |--- weights: [0.00, 154.00] class: 1
#prints the structure of the optimized decision tree model in a text format, including the feature names and the
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
#displays the feature importances of the optimized decision tree model as a DataFrame, which is sorted in descen
print(
pd.DataFrame(
estimator.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
Imp
Income 0.337681
Family 0.275581
Education_2 0.175687
Education_3 0.157286
CCAvg 0.042856
... ...
ZIPCode_92103 0.000000
ZIPCode_92101 0.000000
ZIPCode_92096 0.000000
ZIPCode_92093 0.000000
ZIPCode_93009 0.000000
[477 rows x 1 columns]
Accuracy
Recall
Precision
F1
0
0.984
0.879195
0.956204
0.916084
)
#plotting a horizontal bar chart to visualize the relative importance of each feature used in the optimized deci
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="green", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
#Checks performance on test data
model_performance_and_confusion_matrix(model, X_test, y_test)
Accuracy
Recall
Precision
F1
0
0.984
0.879195
0.956204
0.916084
#Checks performance on test data
decision_tree_tune_post_test = model_performance_and_confusion_matrix(model, X_test, y_test)
decision_tree_tune_post_test
#using the cost complexity pruning path method to determine the effective alphas
#and the corresponding total impurities at each step for a decision tree classifier trained on the given trainin
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
#creates a DataFrame using the values of the "path" variable, that contains the alpha values
#and impurities from the cost-complexity pruning path of a decision tree.
pd.DataFrame(path)
ccp_alphas
impurities
0
0.000000
0.000000
1
0.000276
0.000552
2
0.000279
0.002224
3
0.000381
0.002605
4
0.000476
0.003081
5
0.000500
0.003581
6
0.000513
0.007174
7
0.000527
0.007701
8
0.000544
0.008246
9
0.000545
0.009882
10
0.000625
0.010507
11
0.000700
0.011207
12
0.000762
0.012731
13
0.000882
0.016260
14
0.000940
0.017200
15
0.001305
0.018505
16
0.001647
0.020153
17
0.002333
0.022486
18
0.002407
0.024893
19
0.003294
0.028187
20
0.006473
0.034659
21
0.025146
0.084951
22
0.039216
0.124167
23
0.047088
0.171255
#This line of code is plotting the total impurity of leaves against the effective alpha values for the training
#extracted from the cost-complexity pruning path of a decision tree, to visualize the impact of different alpha
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
#then we have to train a decision tree using effective alphas. This line creates a series of decision
#trees each with different ccp_alpha values from the ccp_alphas array and fitting them to the training data,
#then it prints the number of nodes in the last tree and its corresponding ccp_alpha value
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.04708834100596766
Recall vs alpha for training and testing sets
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
#This line is plotting two graphs: one showing the relationship between the ccp_alpha values
#and the number of nodes in the decision tree, and the other showing the relationship
#between the ccp_alpha values and the depth of the decision tree.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
#This line calculating the recall score for both the training and test datasets across various decision tree mo
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
# plotting the recall scores of the training and testing datasets against different
#values of alpha to visualize how the recall metric varies with the complexity of
#the decision tree (controlled by alpha)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
DecisionTreeClassifier(random_state=1)
Post-Pruning
Accuracy
Recall
Precision
F1
0
0.836286
0.933535
0.359302
0.518892
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
#This line is identifying and printing the best decision tree model
#the one that yields the highest recall score on the test data
#from the list of models trained with different alpha value
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
#This line is defining and training a Decision Tree classifier using a specific ccp_alpha value (0.0470883410059
estimator_2 = DecisionTreeClassifier(
ccp_alpha=0.04708834100596766, class_weight={0: 0.15, 1: 0.85}, random_state=1
)
estimator_2.fit(X_train, y_train)
#Checks the performance metrics and confusion matrix of the estimator_2 decision tree model using the training
model_performance_and_confusion_matrix(estimator_2, X_train, y_train)
#calling a function to calculate and store the performance metrics and confusion matrix of the estimator_2 deci
#using the training data in the decision_tree_tune_post_train variable, and then displays it
decision_tree_tune_post_train = model_performance_and_confusion_matrix(estimator_2, X_train, y_train)
decision_tree_tune_post_train
▾
DecisionTreeClassifier
DecisionTreeClassifier(ccp_alpha=0.04708834100596766,
class_weight={0: 0.15, 1: 0.85}, random_state=1)
Accuracy
Recall
Precision
F1
0
0.836286
0.933535
0.359302
0.518892
Visualizing the Decision Tree
#This line will generate a visual representation of the estimator_2 decision tree model, including arrows indica
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator_2,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
|--- Income <= 98.50
| |--- weights: [392.70, 18.70] class: 0
|--- Income > 98.50
| |--- weights: [82.65, 262.65] class: 1
Imp
Income 1.0
ZIPCode_94123 0.0
ZIPCode_94306 0.0
ZIPCode_94305 0.0
ZIPCode_94304 0.0
... ...
ZIPCode_92069 0.0
ZIPCode_92068 0.0
ZIPCode_92064 0.0
ZIPCode_92056 0.0
Education_3 0.0
[477 rows x 1 columns]
#printing a text-based representation of the estimator_2 decision tree model, including feature names and node w
print(tree.export_text(estimator_2, feature_names=feature_names, show_weights=True))
#showing the feature importances of the estimator_2 decision tree model, sorted in descending order of importan
print(
pd.DataFrame(
estimator_2.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
#creating a horizontal bar plot to visualize the relative importance of features in the
#estimator_2 decision tree model, with features sorted by importance.
#The bars are colored green for better visualization.
importances = estimator_2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(10, 6))
plt.title("Feature Importances")
Accuracy
Recall
Precision
F1
0
0.823333
0.90604
0.349741
0.504673
Accuracy
Recall
Precision
F1
0
0.823333
0.90604
0.349741
0.504673
plt.barh(range(len(indices)), importances[indices], color="green", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
#This line is evaluating the performance of the estimator_2 model on
#the test data and generating a confusion matrix along with various classification metrics.
model_performance_and_confusion_matrix(estimator_2, X_test, y_test)
#This line evaluates the performance of the estimator_2 model on the test data and
#generates a confusion matrix along with various classification metrics, and then displaying the results.
decision_tree_tune_post_test = model_performance_and_confusion_matrix(estimator_2, X_test, y_test)
decision_tree_tune_post_test
Model Comparison and Final Model Selection
Training performance comparison:
Decision Tree sklearn
Decision Tree (Pre-Pruning)
Decision Tree (Post-Pruning)
Accuracy
1.0
1.0
0.836286
Recall
1.0
1.0
0.933535
Precision
1.0
1.0
0.359302
F1
1.0
1.0
0.518892
Setup Accuracy Recall Precision F1
0 Setup 1 0.978667 0.785235 1.000000 0.879699
0 Setup 2 0.823333 0.906040 0.349741 0.504673
#This line of code is creating a DataFrame to compare the training performance
#metrics of three decision tree models: one without pruning,
#one with pre-pruning, and one with post-pruning
models_train_comp_df = pd.concat(
[decision_tree_perf_train.T, decision_tree_tune_perf_train.T,decision_tree_tune_post_train.T], axis=1,
)
models_train_comp_df.columns = ["Decision Tree sklearn", "Decision Tree (Pre-Pruning)","Decision Tree (Post-Pru
print("Training performance comparison:")
models_train_comp_df
#This line computes and compares the performance metrics of two decision tree models on the test data,
#and then combining the results into a DataFrame for comparison
perf_metrics_1 = model_performance_and_confusion_matrix(estimator, X_test, y_test)
perf_metrics_2 = model_performance_and_confusion_matrix(estimator_2, X_test, y_test)
perf_metrics_combined = pd.concat([perf_metrics_1, perf_metrics_2], axis=0)
perf_metrics_combined['Setup'] = ['Setup 1', 'Setup 2']
cols = ['Setup'] + [col for col in perf_metrics_combined if col != 'Setup']
perf_metrics_combined = perf_metrics_combined[cols]
print(perf_metrics_combined)
Actionable Insights and Business Recommendations
Questions:
1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
The majority of the mortgage attributes are clustered under $200,000, so that would be a right-skewed distribution.
1. How many customers have credit cards?
3700 out of 5000 customers own credit cards.
1. What are the attributes that have a strong correlation with the target attribute (personal loan)?
The characteristics with the highest correlation to personal loan acceptance are:
Credit Card ownership with a correlation coefficient of 0.45
Family size with a correlation coefficient of 0.42
Holding a securities account with a correlation coefficient of 0.41
1. How does a customer's interest in purchasing a loan vary with their age?
There is a=decline in the inclination to pursue a loan as age increases. The age group displaying the greatest willingness to accept
loans is between 30 to 50 years old.
1. How does a customer's interest in purchasing a loan vary with their education?
A rising trend in loan acceptance is noted with increasing education. Customers holding graduate degrees are the most likely to
accept loans, with an acceptance rate of 15.8%. This is followed by those with undergraduate degrees at a 9.9% acceptance rate,
and individuals with advanced or professional degrees, with a 7.3% acceptance rate.
What recommedations would you suggest to the bank?
Create customized loan products tailored to individuals holding graduate or professional degrees, as they are likely to have a higher
conversion rate. Additionally, target existing customers who hold securities and CD accounts with personalized personal loan
offerings through cross-selling initiatives.
Develop marketing campaigns that are uniquely aimed at younger individuals with higher income levels, with a strong emphasis on
crafting specialized financial products that align with their financial goals and ambitions. Income and age appear to be more
influential factors compared to other variables considered.
Utilize the pruned decision tree model to pinpoint and focus on customer profiles with the most potential, particularly those
demonstrating strong recall performance, as part of your targeting strategy.
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js